Exploratory Data Analysis
Collisions by Year
As we are currently in the middle of 2023, the data collection for
this year is incomplete and I will limit the data till 2022.
crashes_till_22 <- crashes |>
mutate(year = year(crash_date)) |>
filter(year < 2023)
crashes_by_year <- crashes_till_22 |>
count(year)
crashes_by_year |>
ggplot(aes(x = as.factor(year), y = n)) +
geom_bar(stat="identity", fill = "plum4") +
labs(title = "Collisions by Year (2012-2022)", x = "Year", y = "Number of Crashes") +
scale_y_continuous(label = scales::comma)

There is an increase in the number of collisions until 2016 and then
a decrease after 2018. The most collisions occurred in 2016-2018, with
about 225,000 collisions occurring in each of these years. In 2020, the
number of collisions decreased by almost 50% from the previous year,
possibly owing to stay-at-home orders due to Covid-19.
Fatalities and Injuries from Crashes
injuries_and_fatalities <- crashes_till_22 |>
mutate(year = year(crash_date),
injury = ifelse(number_persons_injured > 0, 1, 0),
death = ifelse(number_persons_killed > 0, 1, 0)) |>
group_by(year) |>
summarize(year_involve_injury = sum(injury, na.rm=T),
year_involve_death = sum(death, na.rm=T)) |>
mutate(total_collisions = crashes_by_year$n,
prop_injured = year_involve_injury / total_collisions,
prop_killed = year_involve_death/ total_collisions) |>
ungroup() |>
pivot_longer(cols = c(prop_injured:prop_killed),
names_to = "severity",
values_to = "prop") |>
mutate(severity = toupper(str_extract(severity, "killed|injured")))
injuries_and_fatalities |>
ggplot(aes(x = as.factor(year), y = prop, fill = severity)) +
geom_bar(stat="identity") +
labs(title = "Collisions Involving Injury/Death per Year (2012-2022)", x = "Year", y = "Percentage of Crashes Involving Injury/Death") +
scale_y_continuous(label = scales::percent) +
geom_text(aes(label = paste0(round(prop*100, 2),'%')), vjust = -0.5, size=3)

While the number of collisions seems to have decreased from
2018-2022, the percentage of collisions involving fatalities and/or
injuries has increased, with almost 40% of collisions in 2022 involving
injuries and about 0.3% of crashes involving deaths.
Collisions by Borough
crashes_till_22 |>
filter(!is.na(borough) & year >= 2017) |>
group_by(year) |>
count(borough) |>
mutate(prop = n / sum(n)) |>
ggplot(aes(x = borough, y = prop)) +
geom_bar(stat="identity", fill = "steelblue") +
facet_wrap(~year) +
labs(title = "Collisions by Borough (2017-2022)", x = "Borough", y = "Percentage of Crashes", caption = "*Perentage out of collisions where borough is known (i.e. not NA)") +
scale_y_continuous(label = scales::percent) +
geom_text(aes(label = paste0(round(prop*100, 0),'%')), size = 3) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Brooklyn and Queens have consistently had the highest percentage of
collisions annually, with Brooklyn making up between 31-35% of
collisions and Queens making up about 27-29% of collisions per year from
2017-2022.
Which areas within these boroughs have the highest number of
collisions?
Brooklyn
crashes_till_22 |>
filter(borough == "BROOKLYN") |>
count(zip_code) |>
drop_na() |>
arrange(desc(n)) |>
head(10) |>
ggplot(aes(y = reorder(as.factor(zip_code), n), x = n)) +
geom_bar(stat = "identity", fill = "steelblue3") +
labs(title = "Highest Collisions by Zipcode - Brooklyn", x = "Count", y = "Zipcode")

The zipcode with the highest number of collisions in Brooklyn is
11207, located in East New York.
Queens
For this, I joined the crashes data to a CSV which I created of
Queens cities based on zipcodes. The information for the zipcodes was
found here.
queens_zips <- read.csv(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/final_project/queens_zipcodes.csv"))
crashes_till_22 |>
filter(borough == "QUEENS") |>
left_join(queens_zips, by = "zip_code") |>
count(city) |>
drop_na() |>
arrange(desc(n)) |>
head(10) |>
ggplot(aes(y = reorder(city, n), x = n)) +
geom_bar(stat = "identity", fill = "steelblue3") +
labs(title = "Highest Collisions by City - Queens", x = "Count", y = "Zipcode")

The highest number of collisions in Queens are in Jamaica and
Flushing.
Main Contributing
Factors
To determine the contributing factors of collisions, I will use the
contributing_factor_vehicle_x columns from the
crashes data set. The data set will have to be
transformed to a long format for this.
contributing_factors_by_crash <- crashes |>
select(collision_id, crash_date, crash_time, borough, c(contributing_factor_vehicle_1:contributing_factor_vehicle_5))
contributing_factors <- contributing_factors_by_crash |>
pivot_longer(col = c(contributing_factor_vehicle_1:contributing_factor_vehicle_5),
names_to = "vehicle",
values_to = "factor") |>
mutate(factor = case_when(str_detect(factor, "Cell Phone") ~ "Cell Phone",
str_detect(factor, "Drugs") ~ "Drugs",
str_detect(factor, "Ill") ~ "Illness",
str_detect(factor, "Uninvolved Vehicle") ~ "Reaction to Uninvolved Vehicle",
TRUE~factor)) |>
filter(!is.na(factor), !factor %in% c("Unspecified", "1", "80"))
contributing_factors |>
count(factor) |>
arrange(desc(n)) |>
mutate(prop = n / sum(n)) |>
head(10) |>
ggplot(aes(x = prop, y = reorder(factor, n))) +
geom_bar(stat="identity", fill = "steelblue") +
scale_x_continuous(label = scales::percent) +
labs(title = "Top 10 Contributing Factors", x = "Percentage of Crashes", y = "Factor")

The main contributing factor in motor vehicle collisions is driver
inattention/distraction, which accounts for about 31% of the causes for
collisions within this data set. Other contributing factors include
failure to yield to right-of-way, following too closely, and other
driver action related factors, as well as driver fatigue.
contributing_factors |>
select(-vehicle) |>
unique() |> # remove replicated crashes (i.e. multiple drivers in one collision with same causal factor)
count(factor) |>
arrange(desc(n)) |>
head(1)
## # A tibble: 1 × 2
## factor n
## <chr> <int>
## 1 Driver Inattention/Distraction 416287
416,287 collisions in this data set were caused by one or more
drivers being inattentive or distracted.
What is the main contributing factor for crashes occurring at each
hour of the day?
Since driver inattention seems to be the most overwhelming factor in
this data set, I will filter that out as the top factor and will check
to see the second most common factor for each hour.
contributing_factors |>
filter(!str_detect(factor, "Inattention/")) |>
mutate(hour = hour(crash_time)) |>
group_by(hour) |>
count(factor) |>
filter(n == max(n)) |>
knitr::kable(col.names = c("Hour", "Factor", "Number of Collisions"))
| 0 |
Following Too Closely |
4003 |
| 1 |
Other Vehicular |
1782 |
| 2 |
Alcohol Involvement |
1590 |
| 3 |
Alcohol Involvement |
1609 |
| 4 |
Alcohol Involvement |
1826 |
| 5 |
Following Too Closely |
1425 |
| 6 |
Following Too Closely |
3006 |
| 7 |
Following Too Closely |
4448 |
| 8 |
Failure to Yield Right-of-Way |
8204 |
| 9 |
Failure to Yield Right-of-Way |
7141 |
| 10 |
Failure to Yield Right-of-Way |
6426 |
| 11 |
Failure to Yield Right-of-Way |
6571 |
| 12 |
Failure to Yield Right-of-Way |
7383 |
| 13 |
Failure to Yield Right-of-Way |
7991 |
| 14 |
Failure to Yield Right-of-Way |
9244 |
| 15 |
Following Too Closely |
9122 |
| 16 |
Failure to Yield Right-of-Way |
10321 |
| 17 |
Failure to Yield Right-of-Way |
10737 |
| 18 |
Failure to Yield Right-of-Way |
9454 |
| 19 |
Failure to Yield Right-of-Way |
7712 |
| 20 |
Failure to Yield Right-of-Way |
6259 |
| 21 |
Failure to Yield Right-of-Way |
5069 |
| 22 |
Failure to Yield Right-of-Way |
4183 |
| 23 |
Failure to Yield Right-of-Way |
3117 |
Between the hours of 2-4 AM, driver inebriation from alcohol
involvement is the main factor contributing to collisions. For all other
hours, following too closely or failure to yield to right of way is the
main contributing factor.
How many collisions are caused by alcohol or drug involvement?
Overall
contributing_factors |>
filter(str_detect(tolower(factor), "drugs|alcohol")) |>
select(-vehicle) |>
unique() |> # remove duplicate crashes
nrow()
## [1] 23486
There are 23,486 incidents of alcohol or drug involvement in car
accidents recorded in this data set.
How many of these accidents resulted in fatalities or serious
injury?
intox_fatalities_injuries <- fatalities_and_injuries |>
pivot_longer(col = c(contributing_factor_vehicle_1:contributing_factor_vehicle_5),
names_to = "vehicle",
values_to = "factor") |>
filter(str_detect(tolower(factor), "drugs|alcohol")) |>
select(-vehicle) |>
unique()
intox_fatalities_injuries |>
nrow()
## [1] 7486
intox_fatalities_injuries |>
mutate(injury = ifelse(number_persons_injured > 0, 1, 0),
fatality = ifelse(number_persons_killed > 0, 1, 0)) |>
summarize(total_injury_inducing = sum(injury),
total_fatality_inducing = sum(fatality))
## # A tibble: 1 × 2
## total_injury_inducing total_fatality_inducing
## <dbl> <dbl>
## 1 7439 117
There are 7,486 fatality/injury inducing accidents caused by
alcohol/drug involvement of one or more drivers, 7,439 of which caused
severe injuries and 117 of which resulting in death.
Per Year (Table)
contributing_factors |>
filter(str_detect(tolower(factor), "drugs|alcohol")) |>
count(year = year(crash_date)) |>
knitr::kable(col.names = c("Year", "Number of Collisions Involving Intoxicated Drivers"))
| 2012 |
856 |
| 2013 |
1760 |
| 2014 |
2261 |
| 2015 |
2201 |
| 2016 |
3033 |
| 2017 |
2848 |
| 2018 |
2577 |
| 2019 |
2389 |
| 2020 |
1741 |
| 2021 |
1924 |
| 2022 |
1953 |
| 2023 |
568 |
Per Year (Graph)
contributing_factors |>
filter(str_detect(tolower(factor), "drugs|alcohol")) |>
select(-vehicle) |>
unique() |>
count(year = as.factor(year(crash_date))) |>
ggplot(aes(x = year, y = n)) +
geom_bar(stat="identity", fill = "steelblue4") +
labs(title = "Number of Collisions Involving Intoxicated* Drivers", x = "Year", y = "Number of Accidents", caption = "*Drugs or Alcohol") +
geom_text(aes(label = n), vjust = -0.5, size=3)

The most recorded collisions from alcohol/drug involvement was in
2016, with 2,925 accidents occurring in relation to intoxicated drivers.
Since Vision Zero was only implemented in 2016, the number of collisions
from 2012-2015 may not be complete so these numbers may not accurately
reflect the true number of crashes caused by alcohol/drug involvement.
Likewise, the data for 2023 is not complete, as the data was accessed
about five months into the year and only accounts for collisions within
those five months. Since the implementation of Vision Zero in 2016, it
seems that collisions resulting from drug or alcohol intoxication has
decreased till 2020 and then started to increase again slightly. The
major dip from 2019 to 2020 may have been caused by stay-at-home
policies due to Covid-19 resulting in less drivers overall on roads
during this time.
Per Hour of Day
contributing_factors |>
filter(str_detect(tolower(factor), "drugs|alcohol")) |>
select(-vehicle) |>
unique() |>
count(hour = as.factor(hour(crash_time))) |>
ggplot(aes(x = hour, y = n)) +
geom_bar(stat="identity", fill = "steelblue4") +
labs(title = "Number of Collisions Involving Intoxicated* Drivers", x = "Hour of Day", y = "Number of Accidents", caption = "*Drugs or Alcohol") +
geom_text(aes(label = n), vjust = -0.5, size=2.5)

The most collisions occur from driver intoxication in the early and
later hours of the day (from midnight to 5 AM and rising steadily from 4
to 11 PM).
Across Boroughs
contributing_factors |>
filter(str_detect(tolower(factor), "drugs|alcohol")) |>
select(-vehicle) |>
filter(!is.na(borough)) |>
unique() |>
count(borough) |>
ggplot(aes(x = borough, y = n)) +
geom_bar(stat="identity", fill = "steelblue4") +
labs(title = "Number of Collisions Involving Intoxicated* Drivers", x = "Borough", y = "Number of Accidents", caption = "*Drugs or Alcohol") +
geom_text(aes(label = n), vjust = -0.5, size=3)

The most collisions resulting from alcohol/drug usage occur in
Brooklyn and Queens.
Collisions by Age
Vision Zero began in 2016, with better methods for collecting and
storing data for collision reports. Therefore, I will limit the data to
crashes from 2016-2022.
drivers_2016_2022 <- drivers |>
mutate(year = year(crash_date)) |>
filter(year >= 2016 & year <= 2022)
hist(drivers_2016_2022$person_age,
main="Distribution of Ages",
xlab = "Driver Age")

The distribution of driver’s ages in the data set is skewed to the
right, with the most collisions being caused by drivers in the 25-40
range.
summary(drivers_2016_2022$person_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 29.00 40.00 41.75 53.00 87.00
The median age of drivers in this data set is 40 years old, and the
middle 50% of data is contained within the 29-53 range.
Collisions by Gender
Overall
drivers_2016_2022 |>
filter(person_sex %in% c("M", "F")) |>
count(person_sex) |>
ggplot(aes(x = n, y = person_sex, fill = person_sex)) +
geom_bar(stat="identity") +
labs(title = "Collisions by Gender (2016-2022)", x = "Number of Collisions", y = "Gender") +
theme(legend.position="none") +
scale_x_continuous(labels = scales::comma)

Overall, there seems to be a greater number of collisions caused by
males than by females.
Per Year
drivers_2016_2022 |>
filter(person_sex %in% c("M", "F")) |>
group_by(year) |>
count(person_sex) |>
ggplot(aes(x = n, y = person_sex, fill = person_sex)) +
geom_bar(stat="identity") +
facet_wrap(~year) +
labs(title = "Collisions by Gender (2016-2022)", x = "Number of Collisions", y = "Gender") +
theme(legend.position="none") +
scale_x_continuous(labels = scales::comma)

Annually, the number of collisions caused by female drivers is less
than half the amount of collisions caused by male drivers.
Collisions by Age and Gender
drivers_2016_2022 |>
filter(person_sex %in% c("M", "F")) |>
ggplot(aes(x = person_age, y = person_sex, fill = person_sex)) +
geom_boxplot() +
labs(title = "Distribution of Ages by Sex", x = "Age", y = "Sex") +
theme(legend.position = "none")

Both plots have a skew to the right, with more collisions being
caused by younger people of both genders. There does seem to be a
relationship between age, gender, and collisions. There seems to be a
higher distribution of ages for males involved in accidents than females
involved in accidents. However, it is hard to tell from the box plots
alone whether this relationship is significant.
males_2016_2022 <- drivers_2016_2022 |>
filter(person_sex == "M")
females_2016_2022 <- drivers_2016_2022 |>
filter(person_sex == "F")
summary(males_2016_2022$person_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 30.00 40.00 42.11 53.00 87.00
summary(females_2016_2022$person_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 29.00 38.00 40.72 51.00 87.00
The median age for men involved in collisions is 40, with 50% of men
involved in collisions being between the ages of 30-53.
The median age for women involved in collisions is 38, with 50% of
women involved in collisions being between the ages of 29-51.
Inference
Based on the exploratory data analysis, I would like to determine the
extent to which alcohol/drug involvement plays a role in the severity of
car crashes, and I would also like to examine if there is a relationship
between age and gender and their combined contribution to the number of
collisions.
Intoxication vs. Collision Severity
First, let’s look at the relationship between driver intoxication and
collision severity.
Question: Is there a significant relationship between
alcohol/drug involvement and serious injuries or deaths in
collisions?
The variables for this analysis are categorical (whether or not there
is a serious injury or death, whether or not there is alcohol
involvement), so I will be using a chi-square test to compare the
observed values for the number of accidents with serious injuries/deaths
for both non-intoxicated and intoxicated drivers against the expected
values if driver intoxication does not play a significant role.
\(H_0\): There is no significant
relationship between alcohol/drug use and serious injuries or deaths in
collisions.
\(H_A\): There is a significant
relationship between alcohol/drug use and serious injuries or deaths in
collisions.
\(\alpha = 0.05\)
Since Vision Zero only started in 2016 and the data for 2023 is not
yet complete, I will filter the data to only collisions between 2016 and
2022. I am also going to include the use of prescription medication
within the realm of drug and alcohol use.
collision_severity <- crashes |>
filter(year(crash_date) >= 2016,
year(crash_date) <= 2022) |>
select(collision_id, number_persons_injured, number_persons_killed) |>
mutate(num_injured_or_killed = number_persons_injured + number_persons_killed)
# subset of collisions involving alcohol/drug use
alcohol_drugs <- contributing_factors |>
filter(year(crash_date) >= 2016,
year(crash_date) <= 2022,
str_detect(tolower(factor), "drugs|alcohol|medication")) |>
select(-vehicle, -factor) |>
unique() |> # remove redundant collisions
left_join(collision_severity, by = "collision_id") |>
mutate(alc_drugs = "Alc/Drugs")
# subset of collisions not involving alcohol/drug use
not_alcohol_drugs <- contributing_factors |>
filter(year(crash_date) >= 2016,
year(crash_date) <= 2022) |>
subset(!collision_id %in% alcohol_drugs$collision_id) |> # any collision not related to any intoxicated driver
select(-vehicle, -factor) |>
unique() |>
left_join(collision_severity, by = "collision_id") |>
mutate(alc_drugs = "No Alc/Drugs")
# all collisions
alc_drug_involvement <- rbind(alcohol_drugs, not_alcohol_drugs)
alc_drug_involvement |>
count(alc_drugs)
## # A tibble: 2 × 2
## alc_drugs n
## <chr> <int>
## 1 Alc/Drugs 17398
## 2 No Alc/Drugs 914768
glimpse(alc_drug_involvement)
## Rows: 932,166
## Columns: 8
## $ collision_id <dbl> 3363357, 3363421, 3363487, 3363489, 3363516, 33…
## $ crash_date <date> 2016-01-01, 2016-01-01, 2016-01-01, 2016-01-01…
## $ crash_time <time> 11:30:00, 04:35:00, 06:30:00, 02:54:00, 03:40:…
## $ borough <chr> "MANHATTAN", "BRONX", "BROOKLYN", "BROOKLYN", "…
## $ number_persons_injured <dbl> 2, 0, 0, 3, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 2, 1,…
## $ number_persons_killed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ num_injured_or_killed <dbl> 2, 0, 0, 3, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 2, 1,…
## $ alc_drugs <chr> "Alc/Drugs", "Alc/Drugs", "Alc/Drugs", "Alc/Dru…
alc_drug_involvement <- alc_drug_involvement |>
mutate(injured_or_killed = ifelse(num_injured_or_killed > 0, "Injuries/Death", "No Injuries/Death"))
serious_injuries_intox_or_not <- table(alc_drug_involvement$injured_or_killed, alc_drug_involvement$alc_drugs)
(chi_sq <- chisq.test(serious_injuries_intox_or_not))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: serious_injuries_intox_or_not
## X-squared = 477.11, df = 1, p-value < 2.2e-16
The p-value here is less than 2.2e-16 which is much less than 0.05 so
there is a significant relationship between driver intoxication and
serious injury/death in collisions.
chi_sq$expected
##
## Alc/Drugs No Alc/Drugs
## Injuries/Death 4302.163 226199.8
## No Injuries/Death 13095.837 688555.2
Assumptions of \(\chi^2\):
As stated above, the variables are categorical. Additionally, each
observation in the alc_drug_involvement data set is
independent of each other, as each records a specific collision where
alcohol or drugs were either involved or not involved. The cells of the
contingency table are mutually exclusive, and the expected value for
100% of the cells is greater than 5. Therefore, all assumptions of
chi-square are met. (Expected values must be greater than 5 in at least
80% of the cells, and cannot go below 1).
Intoxication and Time of Day
Question: Is there a significant relationship between
collisions caused by driver intoxication and time of day?
\(H_0\): There is no significant
relationship between alcohol/drug involvement in collisions and time of
day.
\(H_A\): There is a significant
relationship between alcohol/drug involvement in collisions and time of
day.
\(\alpha = 0.05\)
time_of_day <- alc_drug_involvement |>
mutate(time_of_day = case_when(hour(crash_time) >= 5 & hour(crash_time) < 12 ~ "Morning",
hour(crash_time) >= 12 & hour(crash_time) < 17 ~ "Afternoon",
TRUE ~ "Evening"))
intox_vs_time_of_day <- table(time_of_day$time_of_day, time_of_day$alc_drugs)
(chi_sq2 <- chisq.test(intox_vs_time_of_day))
##
## Pearson's Chi-squared test
##
## data: intox_vs_time_of_day
## X-squared = 6679.8, df = 2, p-value < 2.2e-16
The p-value is 2.2e-16 which is very small and less than 0.05, so
there is a relationship between collisions caused by driver intoxication
and time of day.
chi_sq2$expected
##
## Alc/Drugs No Alc/Drugs
## Afternoon 5479.021 288081.0
## Evening 7140.346 375431.7
## Morning 4778.633 251255.4
Accidents involving driver intoxication are more likely to occur in
the evening (between 5 PM and 5 AM) than they are during the Morning or
Afternoon.
Age and Gender vs. Collisions
Now, let’s look at the relationship between age and gender and their
joint contribution to the likelihood of collisions.
Question: Is there a significant difference in the ages of
males and females involved in motor vehicle collisions?
\(H_0: \mu_1 = \mu_2\) (There is no
significant difference in the ages of males and females involved in
motor vehicle collisions)
\(H_A: \mu_1 \neq \mu_2\) (There is
a significant difference in the ages of males and females involved in
motor vehicle collisions)
Where \(\mu_1\) is the mean age of
males involved in collisions and \(\mu_2\) is the mean age of females involved
in collisions.
\(\alpha = 0.05\)
drivers_2016_2022 |>
filter(person_sex %in% c("M", "F")) |>
group_by(year = year(crash_date), person_sex) |>
summarize(median = median(person_age),
IQR = IQR(person_age),
num_drivers = n())
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 5
## # Groups: year [7]
## year person_sex median IQR num_drivers
## <dbl> <chr> <dbl> <dbl> <int>
## 1 2016 F 38 22 75237
## 2 2016 M 41 23 206770
## 3 2017 F 38 22 92827
## 4 2017 M 41 23 257642
## 5 2018 F 39 22 92607
## 6 2018 M 41 24 259393
## 7 2019 F 39 22 83058
## 8 2019 M 41 24 234352
## 9 2020 F 37 22 35820
## 10 2020 M 39 23 107255
## 11 2021 F 36 21 33068
## 12 2021 M 38 23 96215
## 13 2022 F 37 21 31188
## 14 2022 M 39 24 89300
To make the data more manageable, let’s look at just collisions for
2022.
drivers_2022 <- drivers_2016_2022 |>
filter(person_sex %in% c("M", "F"), year == 2022)
drivers_2022 |>
ggplot(aes(x = person_age, y = person_sex, fill = person_sex)) +
geom_boxplot() +
labs(title = "Distribution of Ages by Sex in 2022", x = "Age", y = "Sex") +
theme(legend.position = "none")

drivers_2022 |>
group_by(person_sex) |>
summarize(mean_age = mean(person_age))
## # A tibble: 2 × 2
## person_sex mean_age
## <chr> <dbl>
## 1 F 40.2
## 2 M 41.5
There is an observed difference. But, is this difference
statistically significant?
males <- drivers_2022 |>
filter(person_sex == "M")
females <- drivers_2022 |>
filter(person_sex == "F")
Assumptions of t-test:
drivers_2022 |>
count(person_sex)
## person_sex n
## 1 F 31188
## 2 M 89300
sd(males$person_age)
## [1] 14.74327
sd(females$person_age)
## [1] 14.25776
qqnorm(males$person_age, main = "Normal Q-Q Plot - Males")
qqline(males$person_age)

qqnorm(females$person_age, main = "Normal Q-Q Plot - Females")
qqline(females$person_age)

Each observation in the data set represents a single driver,
independent of one another. From the QQ-plot, we can see that the
distribution of ages for males and females closely follows the straight
line in the middle (about ages 30-70 for males and about 30-80 for
females), but there are problems towards the tails of the distributions.
Since this is a very large sample size, and the data is relatively
normal, we can continue with the analysis. (A note on this: I also
performed the analysis using the log of person_age to
normalize the data more and saw no difference in the results vs. the
results of the un-transformed data. For interpretability purposes, I
chose to use the un-transformed data.) The variance for both groups is
about the same. Since the data was collected by the NYPD and records
drivers actually involved in accidents, it is reasonable to assume that
the data is random and does not have biases. Therefore, assumptions of
the t-test are met.
(t_test <- t.test(males$person_age, females$person_age, alternative = "two.sided", var.equal = T))
##
## Two Sample t-test
##
## data: males$person_age and females$person_age
## t = 12.944, df = 120486, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.056131 1.433058
## sample estimates:
## mean of x mean of y
## 41.46699 40.22239
The p-value for this t-test is very small (< 2.2e-16), so
therefore we reject the null hypothesis and say that there is a
significant difference in the mean ages between males and females
involved in accidents. The mean age for men is about 41.4 and the mean
age for women is about 40.2. Based on the 95% confidence interval, we
are 95% confident that the true difference in the mean ages for men and
women involved in car accidents is between 1.06 and 1.43.
We can also visualize this significance using the null
distribution.
obs_diff <- drivers_2022 |>
specify(person_age~person_sex) |>
calculate(stat = "diff in means", order = c("M", "F"))
null_dist <- drivers_2022 |>
specify(person_age~person_sex) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "diff in means", order = c("M", "F"))
visualize(null_dist) + shade_p_value(obs_stat = obs_diff, direction = "two_sided")

The observed difference is way off from what we would expect to see
if there was not a statistically significant relationship. Therefore, we
can conclude that there is a significant difference in the mean ages of
men and women who get involved in collisions. As such, there is a
significant relationship between the ages of men and women in relation
to being involved in collisions. Men who get into car accidents tend to
be slightly older than women who get into car accidents.
Cohen’s d:
(mean(males$person_age) - mean(females$person_age)) / sqrt((sd(males$person_age) + sd(females$person_age))/ 2)
## [1] 0.3268407
Conclusions
Based on the analysis of car crash data obtained from the New York
City Police Department, this project identified driver
inattention/distraction as the primary contributing factor for
collisions, with 416,287 collisions recorded as being caused by one or
more drivers who were distracted or not paying attention. Aside from
this, alcohol abuse was found to be the main contributing factor for car
crashes between the hours of 2-4 AM. Using a chi-square test of
independence, the data provided evidence that there are significantly
more accidents resulting in serious injury or death when there is
alcohol or drugs involved than when there is no alcohol or drug
involvement.
The analysis also revealed a significant relationship between gender
and age with respect to collisions. The data also showed a higher number
of collisions caused by men than by women, with the median age of men
involved in collisions (40) being higher than the median age of women
involved in collisions (38). Using a two-sample t-test and visualizing
with the null distribution, this difference was shown to be significant,
with the average age of men involved in collisions being significantly
higher than the average age of women involved in accidents.
According to the data, Brooklyn and Queens had the largest percentage
of collisions in the data set. The zipcode with the highest number of
collisions in Brooklyn was 11207, located in East New York. The highest
number of collisions in Queens occurred in Jamaica and Flushing. This
type of analysis, along with determining streets or cross roads with
high densities of collisions occurring, can allow city planners and
policymakers to determine areas where traffic safety measures need to be
implemented, such as increasing the number of traffic signals or
improving road signs, etc.
While there is more analysis that can be done using this data, this
study provides a brief insight into some of the underlying factors which
contribute to collisions and the severity of collisions. This also
highlights the need for future research to be able to design
evidence-based interventions and strategies to reduce the number of
motor vehicle collisions, serious injuries, and deaths in New York
City.